Project: Parallel Web Crawler
Section 1: Basic Functionality
Criteria | Meet Specification |
---|---|
Write code that passes all unit tests. |
All unit tests must pass:
|
Write a crawler that successfully runs on real web pages (not just tests). |
The following command should output valid results:
|
Respect the configured timeout for the parallel crawler. |
The crawler should stop downloading new URLs after the configured "timeoutSeconds" is reached. The easiest way to test this is to configure a large "maxDepth" (for example, 10) and a small "timeoutSeconds" (for example, 1). The crawler should stop running after about 1 or 2 seconds. |
Use dynamic proxy to always record method invocation times for annotated methods. |
Only record profile data for methods annotated with the |
Use dynamic proxy to return the correct values and handle exceptions correctly. |
Be sure not to accidentally throw an
|
Section 2: Parallelism & Synchronization
Criteria | Meet Specification |
---|---|
Fetch and process pages from multiple threads running in parallel. |
The crawler must fetch and process pages from multiple threads concurrently. The crawler should be implemented using one or more of the following standard Java frameworks:
Different threads must actually run in parallel, meaning the solution is not allowed to use an executor with only one thread, and threads should not be synchronized such that they run serially. |
Correctly synchronize shared data structures to detect and avoid revisiting already seen URLs. |
The crawler should avoid visiting the same web page multiple times. The crawler should track which pages it has already visited so that it will not re-crawl such pages. An in-memory data structure should be used for this purpose. A URL is considered "visited" even if the HTTP response to that URL is an error, but revisits (if any) to that same URL should not count toward the final |
Analyze and reason about concurrent programming scenarios. |
Questions Q1 and Q2 in the |
Section 3: File I/O
Criteria | Meet Specification |
---|---|
Correctly parse and load JSON configuration using their Crawler. |
The The input should be read using the |
Correctly write the result to file in the specified JSON format, which contains the number of pages visited and the top popular words. |
The The output should be written using the |
Program the profiler to correctly write its data to file or standard output. |
When opening input and output streams and writers, make sure you close them. Also, be sure not to close the same stream twice. |
Section 4: Code Design
Criteria | Meet Specification |
---|---|
Write a crawler that sorts and returns the correct word counts using only functional programming techniques such as the |
The |
Make effective use of dependency injection and other design patterns. |
Any parameters you add to the If it makes sense for your design, you should apply the builder pattern and/or factory pattern to construct subtasks. |
Recognize design patterns and can evaluate their effectiveness. |
Questions Q3 and Q4 in the |
Tips to make your project standout:
- Respecting robots.txt files on crawled pages. You can read more about robots.txt here.
- Managing memory use by limiting the growth of the popular word count and profiling data structures, while not compromising accuracy.
- Throttling HTTP requests (e.g., per domain) in order to not overwhelm crawled servers.